HW 5.2

Author

Eric Lim

Published

April 20, 2023

1 Imports

Code
import datashader as ds
import hvplot.pandas
import pandas as pd
import numpy as np

2 Data Transformation

Code
df = pd.read_csv('../../hw5_data/household_power_consumption.txt', sep=';')
print(df.shape)
print(df.dtypes)
(2075259, 9)
Date                      object
Time                      object
Global_active_power       object
Global_reactive_power     object
Voltage                   object
Global_intensity          object
Sub_metering_1            object
Sub_metering_2            object
Sub_metering_3           float64
dtype: object

2.1 Convert Data Types

Code
df['DateTime'] = pd.to_datetime(df['Date'] + ' ' + df['Time']) # takes a LONG time
df['Global_active_power'] = pd.to_numeric(df['Global_active_power'], errors='coerce')

2.2 Handle Missing Values in Global Active Power column

Code
print("Before handling NaNs:", df['Global_active_power'].isna().sum())

## Considering there are 2 million rows, dropping ~26000 rows with missing values doesn't seem unreasonable
df = df[df['Global_active_power'].notna()]

print("After handling NaNs:", df['Global_active_power'].isna().sum())
Before handling NaNs: 25979
After handling NaNs: 0

3 Plotting

Code
df.hvplot.scatter(x='DateTime', y='Global_active_power', groupby=[], color='blue', rasterize=True)

Note: The interpretation below assumes that energy consumption is measured in kWh since no documentation or units were provided.

Energy Consumption from 2006 to 2010

Figure 1: The plot above shows a scatterplot modified by Datashader of the energy usage of a single household from the end of 2006 to the end of 2010. The legend represents the number of points for each respective value of energy usage. There are two primary “darker blue” bands that are noticeable, one within between 0 kWh and 0.4 kWh, and the other between 1.2 kWh and 1.6 kWh, which suggests that the household, on average, consumes energy between the two ranges. Nonetheless, the household also sees spikes in energy, denoted by the lighter blue areas in the upper areas of the plot. There is a noticeable anomaly in August of 2008, where there is minimal energy usage (i.e. no energy usage greater than ~1 kWh).